{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Practical Guide to Running NLP Models: BERT Use Case"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook serves as a practical guide to getting started running Natural Language Processing (NLP) models on Tenstorrent hardware devices using the TT-BUDA compiler stack. *For detailed information on model compatibility, please refer to the [models support table](../model_demos/README.md#models-support-table) to check which model works with which Tenstorrent device(s).*\n",
    "\n",
    "The tutorial will walk through an example of running the [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) model on Tenstorrent AI accelerator hardware. The model weights will be directly downloaded from the [HuggingFace library](https://huggingface.co/docs/transformers/model_doc/bert) and executed through the PyBUDA SDK.\n",
    "\n",
    "**Note on terminology:**\n",
    "\n",
    "While TT-BUDA is the official Tenstorrent AI/ML compiler stack, PyBUDA is the Python interface for TT-BUDA. TT-BUDA is the core technology; however, PyBUDA allows users to access and utilize TT-BUDA's features directly from Python. This includes directly importing model architectures and weights from PyTorch, TensorFlow, ONNX, and TFLite."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Guide Overview\n",
    "\n",
    "In this guide, we will talk through the steps for running the BERT model trained on the [SQuADv1.1](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) dataset for the **Question and Answering** task.\n",
    "\n",
    "You will learn how to import the appropriate libraries, how to download model weights from popular site such as HuggingFace, utilize the PyBUDA API to initiate an inference experiment, and observe the results from running on Tenstorrent hardware."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import libraries\n",
    "\n",
    "Make sure that you have an activate Python environment with the latest version of PyBUDA installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start by importing the pybuda library and modules from HuggingFace's transformers library\n",
    "import os\n",
    "import pybuda\n",
    "from transformers import BertForQuestionAnswering, BertTokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Download the model weights from HuggingFace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load BERT tokenizer and model from HuggingFace for Q&A task\n",
    "model_ckpt = \"bert-large-cased-whole-word-masking-finetuned-squad\"\n",
    "tokenizer = BertTokenizer.from_pretrained(model_ckpt)\n",
    "model = BertForQuestionAnswering.from_pretrained(model_ckpt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Set example input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load data sample from SQuADv1.1\n",
    "context = \"\"\"Super Bowl 50 was an American football game to determine the champion of the National Football League\n",
    "(NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the\n",
    "National Football Conference (NFC) champion Carolina Panthers 24\\u201310 to earn their third Super Bowl title.\n",
    "The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.\n",
    "As this was the 50th Super Bowl, the league emphasized the \\\"golden anniversary\\\" with various gold-themed\n",
    "initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals\n",
    "(under which the game would have been known as \\\"Super Bowl L\\\"), so that the logo could prominently\n",
    "feature the Arabic numerals 50.\"\"\"\n",
    "\n",
    "question = \"Which NFL team represented the AFC at Super Bowl 50?\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Data Preprocessing\n",
    "\n",
    "Data preprocessing is an important step in the AI inference pipeline. For NLP models, we want to make sure that the input text is converted to the appropriate tokenized representation that was used to train the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data preprocessing\n",
    "input_tokens = tokenizer(\n",
    "    question,  # pass question\n",
    "    context,  # pass context\n",
    "    max_length=384,  # set the maximum input context length\n",
    "    padding=\"max_length\",  # pad to max length for fixed input size\n",
    "    truncation=True,  # truncate to max length\n",
    "    return_tensors=\"pt\",  # return PyTorch tensor\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Configure PyBUDA Parameters\n",
    "\n",
    "There are optional configurations that can be adjusted before compiling and running a model on Tenstorrent hardware. Sometimes, the configurations are necessary to compile the model and other times they are tuneable parameters that can be adjusted for performance enhancement.\n",
    "\n",
    "For the BERT model, two key parameters are required for compilation:\n",
    "\n",
    "* `default_df_override`\n",
    "* `default_dram_parameters`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set PyBuda configurations\n",
    "compiler_cfg = pybuda.config._get_global_compiler_config()\n",
    "compiler_cfg.default_df_override = pybuda._C.DataFormat.Float16_b\n",
    "compiler_cfg.default_dram_parameters = False\n",
    "compiler_cfg.balancer_policy = \"Ribbon\"\n",
    "os.environ[\"PYBUDA_RIBBON2\"] = \"1\"\n",
    "os.environ[\"TT_BACKEND_OVERLAY_MAX_EXTRA_BLOB_SIZE\"] = f\"{81*1024}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Instantiate Tenstorrent device\n",
    "\n",
    "The first time we use PyBUDA, we must initialize a `TTDevice` object which serves as the abstraction over the target hardware."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tt0 = pybuda.TTDevice(\n",
    "    name=\"tt_device_0\",  # here we can give our device any name we wish, for tracking purposes\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, `tt0` object is created without specifying the device architecture. Pybuda will automatically detectt and define the architecture based on which Tenstorrent device its run on.\n",
    "\n",
    "### Specifying the Architecture\n",
    "If you want to specify the target device architecture, you can pass the `arch` parameter. Here’s how it can be done:\n",
    "\n",
    "```python\n",
    "# Create a TTDevice instance with a specified architecture\n",
    "tt0 = pybuda.TTDevice(\n",
    "    name=\"tt_device_0\",  # You can give your device any name for tracking purposes\n",
    "    arch=pybuda.BackendDevice.Grayskull  # Optionally set the target device architecture\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Create a PyBUDA module from PyTorch model\n",
    "\n",
    "Next, we must abstract the PyTorch model loaded from HuggingFace into a `pybuda.PyTorchModule` object. This will let the BUDA compiler know which model architecture and AI framework it has to compile.\n",
    "\n",
    "We then \"place\" this module onto the previously initialized `TTDevice`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create module\n",
    "pybuda_module = pybuda.PyTorchModule(\n",
    "    name = \"pt_bert_question_answering\",  # give the module a name, this will be used for tracking purposes\n",
    "    module=model  # specify the model that is being targeted for compilation\n",
    ")\n",
    "\n",
    "# Place module on device\n",
    "tt0.place_module(module=pybuda_module)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Push the (tokenized) inputs into the model input queue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Push inputs\n",
    "tt0.push_to_inputs(input_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 9: Run inference on the targeted device\n",
    "\n",
    "Running a model on a Tenstorrent device invovles two parts: compilation and runtime.\n",
    "\n",
    "Compilation -- TT-BUDA is a compiler. Meaning that it will take a model architecture graph and compile it for the target hardware. Compilation can take anywhere from a few seconds to a few minutes, depending on the model. This only needs to happen once. When you execute the following block of code the compilation logs will be displayed.\n",
    "\n",
    "Runtime -- once the model has been compiled and loaded onto the device, the user can push new inputs which will execute immediately.\n",
    "\n",
    "The `run_inference` API can achieve both steps in a single call. If it's the first call, the model will compile. Any subsequent calls will execute runtime only.\n",
    "\n",
    "Please refer to the documentation for alternative APIs such as `initialize_pipeline` and `run_forward`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run inference on Tenstorrent device\n",
    "output_q = pybuda.run_inference()  # executes compilation (if first time) + runtime\n",
    "output = output_q.get()  # get last value from output queue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 10: Data Postprocessing\n",
    "\n",
    "Data postprocessing is done to convert the model outputs into a readable / useful format. For NLP models, this usually means receiving the logit outputs from the model, extracting the top matching tokens, and then decoding the tokens into text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data postprocessing\n",
    "answer_start = output[0].value().argmax().item()\n",
    "answer_end = output[1].value().argmax().item()\n",
    "answer = tokenizer.decode(input_tokens[\"input_ids\"][0, answer_start : answer_end + 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 11: Print and evaluate outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print outputs\n",
    "print(f\"Input context:\\n{context}\")\n",
    "print(f\"\\nInput question:\\n{question}\")\n",
    "print(f\"\\nOutput from model running on TTDevice:\\n{answer}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 12: Shutdown PyBuda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pybuda.shutdown()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.0.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
